Bayes' Theorem is a fundamental concept in probability theory and statistics that describes how to update the probability of a hypothesis based on new evidence. It is expressed mathematically as:
\[ P(A | B) = \frac{P(B | A) \cdot P(A)}{P(B)} \]
Suppose a medical test for a disease is 99% accurate, and the disease is present in 1% of the population. If a person tests positive, Bayes’ Theorem helps determine the actual probability of having the disease by considering both the accuracy of the test and the base rate of the disease in the population.
Two events are independent if the occurrence of one event does not affect the probability of the other event.
Mathematical Representation:
\[ P(A \cap B) = P(A) \cdot P(B) \]
Example:
Two events are dependent if the occurrence of one event affects the probability of the other.
Mathematical Representation:
\[ P(A \cap B) = P(A) \cdot P(B | A) \]
Example:
Two events A and B are conditionally independent given a third event C if knowing C makes A and B independent.
Mathematical Representation:
\[ P(A \cap B | C) = P(A | C) \cdot P(B | C) \]
Example:
Two events A and B are conditionally dependent given C if knowing C affects the dependency between A and B.
Example:
Two events are mutually exclusive if they cannot happen at the same time.
Mathematical Representation:
\[ P(A \cap B) = 0 \]
Example:
Two events are mutually inclusive if they can happen at the same time.
Mathematical Representation:
\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \]
Example:
The probability that both events A and B occur together:
For Independent Events:
\[ P(A \cap B) = P(A) \cdot P(B) \]
For Dependent Events:
\[ P(A \cap B) = P(A) \cdot P(B | A) \]
Example:
The probability that either event A or event B (or both) occurs:
\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \]
For Mutually Exclusive Events:
\[ P(A \cup B) = P(A) + P(B) \]
Example:
The intersection of two events \( A \) and \( B \), denoted as \( A \cap B \), represents the probability that both events occur simultaneously.
For Independent Events:
\[ P(A \cap B) = P(A) \cdot P(B) \]
For Dependent Events:
\[ P(A \cap B) = P(A) \cdot P(B | A) \]
The union of two events \( A \) and \( B \), denoted as \( A \cup B \), represents the probability that either event occurs (or both).
\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \]
For Mutually Exclusive Events: (Events that cannot happen together, e.g., rolling a 2 or a 5 on a single die roll)
\[ P(A \cup B) = P(A) + P(B) \]
Naïve Bayes is based on Bayes’ Theorem:
\[ P(Y | X) = \frac{P(X | Y) P(Y)}{P(X)} \]
Since \( P(X) \) is constant for all classes, we can simplify the decision rule:
\[ P(Y | X) \propto P(Y) P(X | Y) \]
The algorithm assumes that features are independent given the class:
\[ P(X | Y) = \prod_{i=1}^{n} P(X_i | Y) \]
Thus, classification is based on:
\[ P(Y | X) \propto P(Y) \prod_{i=1}^{n} P(X_i | Y) \]
To avoid zero probabilities for unseen words/features, we use Laplace smoothing:
\[ P(X_i | Y) = \frac{\text{Count}(X_i, Y) + 1}{\sum \text{Count}(X, Y) + V} \]
where:
Since multiplying small probabilities can lead to numerical underflow, we take the logarithm:
\[ \log P(Y | X) = \log P(Y) + \sum_{i=1}^{n} \log P(X_i | Y) \]
This ensures that extremely small probabilities do not become zero.
Suppose we classify an email as spam (\( S \)) or not spam (\( \neg S \)) based on words “Free” and “Win.”
Given the following data:
Applying Laplace Smoothing:
\[ P(\text{"Free"} | S) = \frac{7 + 1}{20 + 50} = \frac{8}{70} = 0.114 \]
\[ P(\text{"Win"} | S) = \frac{6 + 1}{20 + 50} = \frac{7}{70} = 0.1 \]
\[ P(\text{"Free"} | \neg S) = \frac{2 + 1}{30 + 50} = \frac{3}{80} = 0.0375 \]
\[ P(\text{"Win"} | \neg S) = \frac{1 + 1}{30 + 50} = \frac{2}{80} = 0.025 \]
Instead of multiplying probabilities, we take the log:
\[ \log P(S | X) = \log P(S) + \log P(\text{"Free"} | S) + \log P(\text{"Win"} | S) \]
\[ = \log(0.4) + \log(0.114) + \log(0.1) \]
\[ \log P(\neg S | X) = \log P(\neg S) + \log P(\text{"Free"} | \neg S) + \log P(\text{"Win"} | \neg S) \]
\[ = \log(0.6) + \log(0.0375) + \log(0.025) \]
Comparing log probabilities, the class with the highest value is chosen.
Instead of comparing probabilities, we compare log probabilities:
\[ \log P(S | X) > \log P(\neg S | X) \Rightarrow \text{Classify as Spam} \]
\[ \log P(S | X) < \log P(\neg S | X) \Rightarrow \text{Classify as Not Spam} \]
Naïve Bayes with Laplace Smoothing and Log Probabilities is a powerful classification method, especially in text classification problems like spam detection.
The bias-variance tradeoff is a fundamental concept in machine learning that describes the balance between two sources of error:
Ideally, a model should find the right balance between bias and variance to achieve low overall error.
Naïve Bayes assumes that all features are conditionally independent given the class label. This assumption is often unrealistic in real-world data.
Effect of High Bias:
Variance refers to how much the model's predictions change when trained on different subsets of data.
Effect of Low Variance:
Naïve Bayes generally has **high bias but low variance**:
Implications:
A typical bias-variance tradeoff curve looks like this:
Since Naïve Bayes has high bias, some strategies to reduce it include:
Naïve Bayes is a simple and efficient classifier that balances the bias-variance tradeoff by favoring:
It works well in applications like spam detection, sentiment analysis, and text classification, but may struggle when feature dependencies are significant.
Naïve Bayes is considered an interpretable model because:
Unlike models like decision trees or linear regression, Naïve Bayes does not provide explicit feature importance scores. However, we can estimate feature importance by:
Naïve Bayes calculates the probability of a class as:
\[ P(Y | X_1, X_2, ..., X_n) \propto P(Y) \prod_{i=1}^{n} P(X_i | Y) \]
Taking the log probability for numerical stability:
\[ \log P(Y | X_1, ..., X_n) = \log P(Y) + \sum_{i=1}^{n} \log P(X_i | Y) \]
From this equation, we can analyze which features contribute the most by looking at the magnitude of **\( \log P(X_i | Y) \)** values.
There are different ways to measure feature importance in Naïve Bayes:
In spam detection, consider the probabilities:
Since "free" and "offer" have much higher conditional probabilities in spam emails than in non-spam emails, these words are important features for classification.
Naïve Bayes is an interpretable model where feature importance can be estimated using:
While it does not explicitly compute feature importance like tree-based models, its **simple and transparent structure** makes it a useful tool for understanding feature contributions.
In an imbalanced dataset, one class has significantly more instances than another. Naïve Bayes assumes equal prior probabilities unless explicitly corrected, which can lead to biased predictions.
Instead of assuming uniform class probabilities \( P(Y) \), we can set class priors based on observed class frequencies:
\[ P(Y = c) = \frac{\text{count}(Y=c)}{\text{total samples}} \]
Accuracy is unreliable in imbalanced datasets. Instead, use:
Naïve Bayes relies on probability estimates, which can be significantly affected by outliers.
Laplace smoothing prevents zero probabilities due to rare words or extreme values:
\[ P(X_i | Y) = \frac{\text{count}(X_i, Y) + \alpha}{\text{count}(Y) + \alpha \cdot |V|} \]
where \( \alpha \) (usually 1) prevents division by zero.
To improve Naïve Bayes on imbalanced data and outliers:
By addressing these challenges, Naïve Bayes can remain a robust and interpretable model for real-world data.